NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Machine Learning in Small-Molecule Mass Spectrometry

https://doi.org/10.1146/annurev-anchem-071224-082157

Hong, Yuhui; Ye, Yuzhen; Tang, Haixu (May 2025, Annual Review of Analytical Chemistry)

Tandem mass spectrometry (MS/MS) is crucial for small-molecule analysis; however, traditional computational methods are limited by incomplete reference libraries and complex data processing. Machine learning (ML) is transforming small-molecule mass spectrometry in three key directions: (a) predicting MS/MS spectra and related physicochemical properties to expand reference libraries, (b) improving spectral matching through automated pattern extraction, and (c) predicting molecular structures of compounds directly from their MS/MS spectra. We review ML approaches for molecular representations [descriptors, simplified molecular-input line-entry (SMILE) strings, and graphs] and MS/MS spectra representations (using binned vectors and peak lists) along with recent advances in spectra prediction, retention time, collision cross sections, and spectral matching. Finally, we discuss ML-integrated workflows for chemical formula identification. By addressing the limitations of current methods for compound identification, these ML approaches can greatly enhance the understanding of biological processes and the development of diagnostic and therapeutic tools.
more » « less
Free, publicly-accessible full text available May 15, 2026
Protein domain embeddings for fast and accurate similarity search

https://doi.org/10.1101/gr.279127.124

Iovino, Benjamin Giovanni; Tang, Haixu; Ye, Yuzhen (September 2024, Genome Research)

Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as adomain segmentationproblem and can be solved using arecursive cutalgorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed asDCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
more » « less
Full Text Available
SpecEncoder: deep metric learning for accurate peptide identification in proteomics

https://doi.org/10.1093/bioinformatics/btae220

Liu, Kaiyuan; Tao, Chenghua; Ye, Yuzhen; Tang, Haixu (June 2024, Bioinformatics)

Abstract MotivationTandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. ResultsWe evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%–2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%–15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%–12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder’s potential to enhance peptide identification for proteomic data analyses. Availability and ImplementationThe source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.
more » « less
Enhanced Structure-Based Prediction of Chiral Stationary Phases for Chromatographic Enantioseparation from 3D Molecular Conformations

https://doi.org/10.1021/acs.analchem.3c04028

Hong, Yuhui; Welch, Christopher J; Piras, Patrick; Tang, Haixu (February 2024, Analytical Chemistry)

The accurate prediction of suitable chiral stationary phases (CSPs) for resolving the enantiomers of a given compound poses a significant challenge in chiral chromatography. Previous attempts at developing machine learning models for structure-based CSP prediction have primarily relied on 1D SMILES strings\footnote{The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.} or 2D graphical representations of molecular structures, and have met with only limited success. In this study, we apply the recently developed 3D molecular conformation representation learning algorithm, which uses rapid conformational analysis and point clouds of atom positions in 3D space, enabling efficient chemical structure-based machine learning. By harnessing the power of the rapid 3D molecular representation learning and a dataset comprising over 300,000 chromatographic enantioseparation records sourced from the literature, our models afford notable improvements for the chemical structure-based choice of appropriate CSP for enantioseparation, paving the way for more efficient and informed decision-making in the field of chiral chromatography.
more » « less
Full Text Available
Accurate de novo peptide sequencing using fully convolutional neural networks

https://doi.org/10.1038/s41467-023-43010-x

Liu, Kaiyuan; Ye, Yuzhen; Li, Sujun; Tang, Haixu (December 2023, Nature Communications)

Abstract De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we presentPepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.
more » « less
Functional profile of host microbiome indicates Clostridioides difficile infection

https://doi.org/10.1080/19490976.2022.2135963

Nzabarushimana, Etienne; Tang, Haixu (December 2022, Gut Microbes)

Full Text Available
Sherlock on Specs: Building LTE Conformance Tests through Automated Reasoning

Chen, Yi; Tang, Di; Yao, Yepeng; Zha, Mingming; Wang, Xiaofeng; Liu, Xiaozhong; Tang, Haixu; Liu, Baoxu (August 2023, 32nd USENIX Security Symposium)

Full Text Available
3DMolMS: prediction of tandem mass spectra from 3D molecular conformations

https://doi.org/10.1093/bioinformatics/btad354

Hong, Yuhui; Li, Sujun; Welch, Christopher J.; Tichy, Shane; Ye, Yuzhen; Tang, Haixu; Elofsson, ed., Arne (May 2023, Bioinformatics)

Abstract MotivationTandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds’ 3D conformations, and thus neglected critical structural information. ResultsWe present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. Availability and implementationThe codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.
more » « less
Locality-Sensitive Hashing-Based k-Mer Clustering for Identification of Differential Microbial Markers Related to Host Phenotype

https://doi.org/10.1089/cmb.2021.0640

Han, Wontack; Tang, Haixu; Ye, Yuzhen (July 2022, Journal of Computational Biology)

Full Text Available
The evolving privacy and security concerns for genomic data analysis and sharing as observed from the iDASH competition

https://doi.org/10.1093/jamia/ocac165

Kuo, Tsung-Ting; Jiang, Xiaoqian; Tang, Haixu; Wang, XiaoFeng; Harmanci, Arif; Kim, Miran; Post, Kai; Bu, Diyue; Bath, Tyler; Kim, Jihoon; et al (September 2022, Journal of the American Medical Informatics Association)

Abstract Concerns regarding inappropriate leakage of sensitive personal information as well as unauthorized data use are increasing with the growth of genomic data repositories. Therefore, privacy and security of genomic data have become increasingly important and need to be studied. With many proposed protection techniques, their applicability in support of biomedical research should be well understood. For this purpose, we have organized a community effort in the past 8 years through the integrating data for analysis, anonymization and sharing consortium to address this practical challenge. In this article, we summarize our experience from these competitions, report lessons learned from the events in 2020/2021 as examples, and discuss potential future research directions in this emerging field.
more » « less
Full Text Available

« Prev Next »

Search for: All records